Grouping Data
Grouping data is an essential task for data analysis, allowing us to understand and manipulate data at a group level. Pandas provides the groupby() function to facilitate this, implementing the split-apply-combine pattern. This pattern involves splitting the data into groups, applying a function to each group, and then combining the results.
Splitting
To illustrate how groupby() works, let's use a dataset and go through an example step by step.
Importing Libraries and Loading Data
import pandas as pd
import numpy as np
# Load the US census data
df = pd.read_csv('datasets/census.csv')
# Exclude state-level summarizations
df = df[df['SUMLEV'] == 50]
df.head()
Grouping by State
First, let's calculate the average population for each state using a for loop and timing the operation.
%%timeit -n 3
for state in df['STNAME'].unique():
avg = np.average(df.where(df['STNAME'] == state).dropna()['CENSUS2010POP'])
print('Counties in state ' + state + ' have an average population of ' + str(avg))
Now, let's use groupby() to achieve the same result more efficiently.
%%timeit -n 3
for group, frame in df.groupby('STNAME'):
avg = np.average(frame['CENSUS2010POP'])
print('Counties in state ' + group + ' have an average population of ' + str(avg))
Using Functions for Grouping
We can also group data by using custom functions. For example, we can group states into batches based on the first letter of the state name.
df = df.set_index('STNAME')
def set_batch_number(item):
if item[0] < 'M':
return 0
if item[0] < 'Q':
return 1
return 2
for group, frame in df.groupby(set_batch_number):
print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')
Grouping with Multi-Index
For more complex grouping, we can use a multi-index. Let's group Airbnb data by cancellation policy and review scores.
df = pd.read_csv("datasets/listings.csv")
df = df.set_index(["cancellation_policy", "review_scores_value"])
for group, frame in df.groupby(level=(0, 1)):
print(group)
We can also use a custom function to group the data.
def grouping_fun(item):
if item[1] == 10.0:
return (item[0], "10.0")
else:
return (item[0], "not 10.0")
for group, frame in df.groupby(by=grouping_fun):
print(group)
Applying Functions to Groups
Aggregation
Aggregation involves computing a summary statistic (e.g., mean, sum) for each group.
df = df.reset_index()
df.groupby("cancellation_policy").agg({"review_scores_value": np.nanmean})
df.groupby("cancellation_policy").agg({
"review_scores_value": (np.nanmean, np.nanstd),
"reviews_per_month": np.nanmean
})
Transformation
Transformation applies a function to each group, returning an object of the same size.
cols = ['cancellation_policy', 'review_scores_value']
transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.rename({'review_scores_value': 'mean_review_scores'}, axis='columns', inplace=True)
df = df.merge(transform_df, left_index=True, right_index=True)
df['mean_diff'] = np.absolute(df['review_scores_value'] - df['mean_review_scores'])
df.head()
Filtering
Filtering removes entire groups based on a condition.
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2)
Applying Custom Functions
The apply() function allows applying an arbitrary function to each group and combining the results.
df = pd.read_csv("datasets/listings.csv")
df = df[['cancellation_policy', 'review_scores_value']]
def calc_mean_review_scores(group):
avg = np.nanmean(group["review_scores_value"])
group["review_scores_mean"] = np.abs(avg - group["review_scores_value"])
return group
df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()